1. Introduction

This notebook contains a differential gene expression analysis and a preliminary thresholded over-representation analysis for a selected dataset from the GEO database. The dataset in particular is GSE157194, which is associated with a publication titled “Atopic dermatitis displays stable and dynamic skin transcriptome signatures” (Möbus et al., 2021).

The goal of the study was to understand the effect of various treatments for atopic dermatitis by conducting a gene expression study using mRNA-Seq data taken from biopsies of the affected and non-affected skin of human patients. The dataset contains data for 57 patients, with samples of affected (intrapersonal lesional) and non-affected (non-lesional) skin, along with follow-ups after three months of treatment (with either dupilumab or cyclosporine). An important note is that the dataset is “incomplete”/has several prominent issues:

No information regarding these missing features were found in the associated paper.

After downloading the raw count matrix from GEO via GEOquery (Davis & Meltzer, 2007), the data was cleaned (confirming that there were no duplicate genes), filtered by gene (removing those without at least one read per million in each sample), and filtered by patient (removing those with incomplete data; see the points above). The genes (rows) were labelled by their Ensembl gene IDs, which were mapped to HUGO gene symbols using biomaRt (Durinck et al., 2009), and genes that were unable to be converted (such as novel genes) were removed. To normalize the data, trimmed mean of m-values/TMM was used (Robinson et al., 2010), since we followed the assumption that most genes are not differentially expressed. Finally, the data was checked for outliers (none were found), and the resulting data frame was written to a tab-delimited txt file (and gzipped for upload to GitHub).

We can load the dataset here and take a look at the first few rows and columns:

# The dataset should originally be compressed (GitHub 25 MB)
filepath <- file.path(getwd(), "normalizedCountData.txt")
if (!file.exists(filepath)) {
  gzfilepath <- file.path(getwd(), "normalizedCountData.txt.gz")
  if (!file.exists(gzfilepath)) {
    stop("Dataset not found")
  }
  GEOquery::gunzip(gzfilepath)
}

normalizedCountData <- read.table(file = file.path(getwd(), "normalizedCountData.txt"),
                                  sep = "\t",
                                  header = TRUE,
                                  stringsAsFactors = FALSE,
                                  check.names = FALSE)
knitr::kable(normalizedCountData[1:4, 1:4], type = "html")
Patient_1_AN_m0 Patient_1_AL_m0 Patient_3_AN_m0 Patient_3_AL_m0
A1BG-AS1 1.216988 1.0737920 1.901915 2.7945041
A2M 213.360038 263.8843886 386.515720 303.5713893
A2M-AS1 1.880799 0.9203932 2.057173 0.6618562
A2ML1 131.268699 143.3512345 156.034665 127.2234747

Number of patients: 24
Number of samples: 96
Number of genes: 17747

The column names are formatted “Patient_<patient number>_<AN|AL>_<m0|m3>“, where AN means intrapersonal non-lesional, AL means intrapersonal lesional, m0 means the sample was taken before treatment, and m3 means the sample was taken after three months of treatment. Note the absence of patient 2 between patients 1 and 3, who was removed due to missing data.

2. Differential gene expression

The corresponding journal entry for this assignment can be found here.

First, we should inspect the data as a heatmap to get a general idea of gene expression levels. But before we do that, we should row-normalize it, which scales rows and centres them around the mean, so that no particular gene completely dominates the plot.

# Parameters to select the range of samples to use. The matrix has 17,747 rows and 96 columns, and if too much data is
# chosen, the code will crash
SAMPLE_START <- 1
# Setting SAMPLE_END to 12 or more has been tested to crash
SAMPLE_END <- 8
heatmapMatrix <- normalizedCountData[, SAMPLE_START:SAMPLE_END]

# Row-normalization can be done with the base scale function on the transposed matrix
heatmapMatrix <- t(scale(t(heatmapMatrix)))
heatmapMatrix[is.na(heatmapMatrix)] <- 0

if (min(heatmapMatrix) == 0) {
  # No negative values => colours should only range from white to red
  heatmapColours <- circlize::colorRamp2(c(0, max(heatmapMatrix)),
                                         c("white", "red"))
} else {
  # Has negative values => use blue for those
  heatmapColours <- circlize::colorRamp2(c(min(heatmapMatrix), 0, max(heatmapMatrix)),
                                         c("blue", "white", "red"))
}

currHeatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmapMatrix),
                                       show_row_dend = TRUE,
                                       show_column_dend = FALSE,
                                       col = heatmapColours,
                                       show_column_names = TRUE,
                                       show_row_names = FALSE,
                                       show_heatmap_legend = TRUE,
                                       cluster_columns = FALSE,
                                       name = "Key",
                                       column_title = "Normalized heatmap of samples")

# Displaying the heatmap may take a while to run, or just crash
currHeatmap

By default (and only if the last line in the code snippet directly above is not commented out), the heatmap will be displayed for eight samples (four patients, both before treatment). The parameters at the start of the code snippet can be changed to display different parts of the entire heatmap. Regardless, we can already see that there is significant variation in the data between different patients. This will need to be taken into account when doing further analysis.

Testing a single gene

The authors of the paper hypothesized that IL13 (interleukin 13, a gene related to type 2 cytokines and chemokines), among others, were responsible for the lesions found in patients with AD, so we can run a quick test to see if there is any change in expression for this particular gene among the entire set of patients.

# Get the columns corresponding to pre- and post-treatment affected samples
preTreatAffSamples <- grep(colnames(normalizedCountData), pattern = "_AL_m0")
postTreatAffSamples <- grep(colnames(normalizedCountData), pattern = "_AL_m3")

# Find the row index of IL13 in the data
il13Idx <- which(rownames(normalizedCountData) == "IL13")

# Format the expression levels (for both pre- and post-treatment) in a table for comparison
il13Expr <- t(normalizedCountData[il13Idx, preTreatAffSamples])
il13Expr <- as.data.frame(il13Expr)
il13Expr <- cbind(il13Expr, t(normalizedCountData[il13Idx, postTreatAffSamples]))
colnames(il13Expr) <- c("Pre-treatment", "Post-treatment")

# A visual inspection of some of the data may be interesting
knitr::kable(il13Expr[1:8, ], type = "html")
Pre-treatment Post-treatment
Patient_1_AL_m0 0.2684480 0.2614197
Patient_3_AL_m0 0.2206187 1.0196571
Patient_4_AL_m0 0.8283549 0.0968785
Patient_7_AL_m0 3.1572718 0.3413900
Patient_8_AL_m0 0.1581532 0.2282131
Patient_9_AL_m0 0.2365710 4.0256386
Patient_11_AL_m0 0.2784309 0.9275250
Patient_12_AL_m0 0.4850444 0.7312003

Looking at the expression change may not necessarily be conclusive, so we can also try a simple t-test.

t.test(x = t(il13Expr$`Pre-treatment`), y=t(il13Expr$`Post-treatment`))

    Welch Two Sample t-test

data:  t(il13Expr$`Pre-treatment`) and t(il13Expr$`Post-treatment`)
t = -0.85029, df = 42.098, p-value = 0.4
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.1635641  0.4736868
sample estimates:
mean of x mean of y 
0.8118375 1.1567761 

We see that the results of the test do not suggest a difference in expression before and after treatment for the entire set of patients, which provides evidence to suggest that patient variation is a major factor that needs to be accounted for.

Creating a model

To try and control for these variables, we can use various packages that provide methods specialized for differential expression analysis. In particular, edgeR is a package especially suited for bulk RNA-Seq analysis, containing several linear methods that we can use (Robinson et al., 2010). We start by setting up the counts header into individual categories, which should make it easier to use different parts of the data for different analyses or plots.

# Defines different groups that can be combined for different things (adapted from A1)
splitColNames <- function(colNames) {
  return(unlist(strsplit(colNames, split = "_"))[c(2, 3, 4)])
}
groups <- data.frame(lapply(colnames(normalizedCountData), splitColNames))
colnames(groups) <- colnames(normalizedCountData)
rownames(groups) <- c("Patient", "Sample type", "Sample time")
groups <- data.frame(t(groups))

# Check the first few lines of the whole table to make sure the groups have been defined properly
knitr::kable(groups[1:4, ], type = "html")
Patient Sample.type Sample.time
Patient_1_AN_m0 1 AN m0
Patient_1_AL_m0 1 AL m0
Patient_3_AN_m0 3 AN m0
Patient_3_AL_m0 3 AL m0

Next, we can throw (parts of) the data into an MDS plot in order to get some insight as to how we should set up our linear models.

# Parameters to select which subset of data to plot
PATIENT_START <- 1
PATIENT_END <- 5
# Get the affected subset of the data
affectedCountData <- normalizedCountData[grep(colnames(normalizedCountData), pattern = "_AL_")
                                         ][, c(PATIENT_START:PATIENT_END, (PATIENT_START + 24):(PATIENT_END + 24))]
affectedGroups <- groups[groups$Sample.type == "AL",
                         ][c(PATIENT_START:PATIENT_END, (PATIENT_START + 24):(PATIENT_END + 24)), ]

# Build the MDS plot, colouring by time (pre- and post-treatment)
dgelist <- edgeR::DGEList(counts = affectedCountData, group = affectedGroups$Sample.time)
edgeR::plotMDS.DGEList(dgelist, labels = rownames(affectedGroups),
                       col = c("darkgreen", "blue")[factor(affectedGroups$Sample.time)])
legend(x = "bottomleft", legend = c("Pre-treatment", "Post-treatment"), fill = c("darkgreen", "blue"))
title(main = "Distance between samples, coloured by treatment time")

Also try grouping by patient to see approximately how much variability there is.

# Colour by patient number
patientColours <- rainbow(PATIENT_END - PATIENT_START + 1)
edgeR::plotMDS.DGEList(dgelist, labels = rownames(affectedGroups),
                       col = patientColours)
title(main = "Distance between samples, coloured by patient number")

We see that there is no especially obvious clustering in treatment time or individual patient, which could be explained by the imperfect data; e.g., perhaps the patients shown here used different drugs or some were actually controls. Regardless, there is incentive to try fitting the data to a linear model to see if any improvements can be found.

edgeR provides a convenient quasi-likelihood method that has been recommended for bulk RNA-Seq experiments and is often used for more complicated models. Since the dataset that we are working with is quite large and complex, it may be a good candidate to use. We will first try a straightforward model that does not account for patient variability.

# Create a linear design matrix for edgeR quasi likelihood in treatment time
modelDesign <- model.matrix(~ groups$Sample.time)
dgelist <- edgeR::DGEList(counts = normalizedCountData, group = groups$Sample.time)
dgelist <- edgeR::estimateDisp(dgelist, modelDesign)
modelFit <- edgeR::glmQLFit(dgelist, modelDesign)

# Extract the results
qlfTime <- edgeR::glmQLFTest(modelFit, coef = "groups$Sample.timem3")
qlfHits <- edgeR::topTags(qlfTime, sort.by = "PValue", n = nrow(normalizedCountData))
knitr::kable(edgeR::topTags(qlfTime)[1:8, ], type = "html")
logFC logCPM F PValue FDR
CCL18 -2.5618699 6.650771 50.53415 0 3.50e-06
SPARC 1.3570281 10.440376 45.50124 0 8.60e-06
BCL6 -0.5799921 6.810072 44.66884 0 8.60e-06
CCL3-AS1 -1.8888323 1.666397 43.34914 0 1.03e-05
CCL13 -1.5892900 5.469213 40.18952 0 2.55e-05
FKBP5 -0.9104746 7.304668 39.31521 0 2.92e-05
APLNR 0.9643666 4.052355 38.77831 0 3.04e-05
NPTX2 -1.2510772 2.645831 38.09414 0 3.42e-05
x
BH
x
groups$Sample.timem3
x
glm

Number of genes that pass the threshold p-value (0.05): 2541
Number of genes that pass correction (FDR): 280

Since we are doing many tests, the probability that a positive result will occur simply by chance is increased. To control for this, we can use a method of multiple hypothesis testing, which will try to correct the positivity rate. edgeR’s quasi-likelihood method does this by using false discovery rate, a fairly common way to correct the data.

We see that a fairly large number of genes are differentially expressed, even after correction with FDR, so an even stronger threshold can be considered. This will allow for greater levels of confidence when analyzing the data, and will also put a greater emphasis on which genes can be considered to be more important because they are significantly differentially expressed.

Number of genes that pass a stricter p-value (0.01): 976
Number of genes that pass correction (FDR): 94

Even stricter thresholds could be considered (such as 0.001), but we do not want to completely throw away the weaker signal. We can next try to fit the data to a model that incorporates patient number into it, which may have better results in downstream analysis.

# Essentially the same as before, but with additional data
modelDesignPatient <- model.matrix(~ groups$Patient + groups$Sample.time)
dgelist <- edgeR::DGEList(counts = normalizedCountData, group = groups$Sample.time)
dgelist <- edgeR::estimateDisp(dgelist, modelDesignPatient)
modelFit <- edgeR::glmQLFit(dgelist, modelDesignPatient)

qlfTime <- edgeR::glmQLFTest(modelFit, coef = "groups$Sample.timem3")
qlfHitsPatient <- edgeR::topTags(qlfTime, sort.by = "PValue", n = nrow(normalizedCountData))
knitr::kable(edgeR::topTags(qlfTime[1:8, ]), type = "html")
logFC logCPM F PValue FDR
A2ML1 -0.4733038 7.791353 13.1409656 0.0005242 0.0041934
A2M 0.1952677 8.021805 4.0324669 0.0482401 0.1929606
AAAS -0.0618142 4.961146 2.6766718 0.3108878 0.4984121
A2M-AS1 0.2207703 1.697648 5.2621708 0.3623892 0.4984121
A4GALT 0.0833681 4.645910 0.7130782 0.4011140 0.4984121
AACSP1 -0.2199525 1.581059 2.9539634 0.4035144 0.4984121
AACS 0.0655970 6.777246 0.6130520 0.4361106 0.4984121
A1BG-AS1 0.0373999 1.985611 0.1546446 0.8453775 0.8453775
x
BH
x
groups$Sample.timem3
x
glm

Number of genes that pass the threshold p-value (0.05): 3089
Number of genes that pass correction (FDR): 719

Number of genes that pass a stricter p-value (0.01): 1414
Number of genes that pass correction (FDR): 267

When applying a model that takes patient variability into account, we see that the number of genes that are significantly differentially expressed increases, which is not unexpected.

Genes of interest

A quick next step is to try and find genes of interest; e.g., those that are the most differentially expressed. One method of accomplishing this task is to use a volcano plot, which puts p-value against log-fold change. In this way, genes with the most meaningful or significant changes will be found near the top-left and top-right corners of the graph. Here, the R package EnhancedVolcano is used to quickly throw together such a plot (Blighe et al., 2021).

# Get the log-fold change and p-values from the patient-variability-included model
volcano <- data.frame(log2FC = qlfTime$table$logFC, PValue = qlfTime$table$PValue)
rownames(volcano) <- rownames(normalizedCountData)
EnhancedVolcano::EnhancedVolcano(volcano, lab = rownames(volcano), x = "log2FC", y = "PValue",
                                 title = "Differentially expressed genes")

We see that the vast majority of genes (17,747 in total) are not significantly differentially expressed, being plotted toward the bottom-centre of the graph, but there are a few genes that present themselves as of interest. For example, the plot identifies CCL18, IL19, NPTX2, ADORA3, ALOX15, SPARC, ACSM6, and SPINK9 as genes with significant differences in levels of expression.

The next step is to return to looking at heatmaps; we can now plot the results of applying the quasi-likelihood model to the data, which may result in clearer clusters that can be visually identified.

# Using the stricter p-value threshold
topHits <- rownames(qlfHits$table)[qlfHits$table$PValue < 0.01]
# Warning: heatmapMatrix may only have been built with a subset of the data. topHeatmapMatrix here will reflect that
# subset. Change SAMPLE_START and SAMPLE_END in the code snippet at the start of this section to change the selected
# data
topHeatmapMatrix <- t(scale(t(heatmapMatrix[which(rownames(heatmapMatrix) %in% topHits), ])))
topHeatmapMatrix[is.na(topHeatmapMatrix)] <- 0

if (min(topHeatmapMatrix) == 0) {
  heatmapColours <- circlize::colorRamp2(c(0, max(topHeatmapMatrix)),
                                         c("white", "red"))
} else {
  heatmapColours <- circlize::colorRamp2(c(min(topHeatmapMatrix), 0, max(topHeatmapMatrix)),
                                         c("blue", "white", "red"))
}

groupColours <- c("blue", "red")
names(groupColours) <- unique(groups$Sample.type)
heatmapAnno <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(Type = groups$Sample.type[SAMPLE_START:SAMPLE_END]),
                                                 col = list(type = groupColours))

currHeatmap <- ComplexHeatmap::Heatmap(as.matrix(topHeatmapMatrix),
                                       show_row_dend = TRUE,
                                       show_column_dend = TRUE,
                                       col = heatmapColours,
                                       show_column_names = TRUE,
                                       show_row_names = FALSE,
                                       show_heatmap_legend = TRUE,
                                       cluster_columns = FALSE,
                                       name = "Key",
                                       column_title = "Top hits heatmap",
                                       top_annotation = heatmapAnno)

currHeatmap

This heatmap is not particularly useful in determining if differentially expressed genes cluster, because the columns are ordered such that individual patient data is paired together. For example, the first column here is the control sample for the first patient (the non-affected skin) pre-treatment, and the second column is the affected skin sample, also pre-treatment. It should be possible to get a better idea of clustering over time by reordering the columns.

# Build a new heatmap matrix to decouple from SAMPLE_START and SAMPLE_END
reHeatmapMatrix <- normalizedCountData
reHeatmapMatrix <- t(scale(t(reHeatmapMatrix)))
reHeatmapMatrix[is.na(reHeatmapMatrix)] <- 0

# Don't know how to do this in a one-liner
newIdxs <- integer(SAMPLE_END - SAMPLE_START + 1)
currIdx <- 1
# Add the *_AN_m0 first
for (i in seq(SAMPLE_START, SAMPLE_END, 2)) {
  newIdxs[currIdx] <- i
  currIdx <- currIdx + 1
}
# Then add the *_AN_m3
for (i in seq(SAMPLE_START, SAMPLE_END, 2)) {
  newIdxs[currIdx] <- i + 48
  currIdx <- currIdx + 1
}

topHeatmapMatrix <- t(scale(t(reHeatmapMatrix[which(rownames(reHeatmapMatrix) %in% topHits), ])))
topHeatmapMatrix[is.na(topHeatmapMatrix)] <- 0
topHeatmapMatrix <- topHeatmapMatrix[, newIdxs]

groupColours <- c("blue", "red")
names(groupColours) <- unique(groups$Sample.time)
heatmapAnno <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(Time = groups$Sample.time[newIdxs]),
                                                 col = list(type = groupColours))

currHeatmap <- ComplexHeatmap::Heatmap(as.matrix(topHeatmapMatrix),
                                       show_row_dend = TRUE,
                                       show_column_dend = TRUE,
                                       col = heatmapColours,
                                       show_column_names = TRUE,
                                       show_row_names = FALSE,
                                       show_heatmap_legend = TRUE,
                                       cluster_columns = FALSE,
                                       name = "Key",
                                       column_title = "Top hits, control, over time",
                                       top_annotation = heatmapAnno)

currHeatmap

The above heatmap has been rearranged such that all eight columns are control (non-affected/non-lesional) skin of affected patients, with the left four columns being samples taken just prior to treatment, and the right four samples being taken after three months of treatment. Since this subset of the dataset does not contain any samples of affected (lesional) skin, it is not surprising that levels of gene expression are similar across the two categories. The most obvious difference would be with respect to patient 3 (and patient 1, to a certain extent), where the follow-up sample has greater levels of regulation–but it is still of the same type.

# Doing the same as above except with *_AL_* instead of *_AN_*
newIdxs <- integer(SAMPLE_END - SAMPLE_START + 1)
currIdx <- 1
for (i in seq(SAMPLE_START + 1, SAMPLE_END + 1, 2)) {
  newIdxs[currIdx] <- i
  currIdx <- currIdx + 1
}
for (i in seq(SAMPLE_START + 1, SAMPLE_END + 1, 2)) {
  newIdxs[currIdx] <- i + 48
  currIdx <- currIdx + 1
}

topHeatmapMatrix <- t(scale(t(reHeatmapMatrix[which(rownames(reHeatmapMatrix) %in% topHits), ])))
topHeatmapMatrix[is.na(topHeatmapMatrix)] <- 0
topHeatmapMatrix <- topHeatmapMatrix[, newIdxs]

groupColours <- c("blue", "red")
names(groupColours) <- unique(groups$Sample.time)
heatmapAnno <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(Time = groups$Sample.time[newIdxs]),
                                                 col = list(type = groupColours))

currHeatmap <- ComplexHeatmap::Heatmap(as.matrix(topHeatmapMatrix),
                                       show_row_dend = TRUE,
                                       show_column_dend = TRUE,
                                       col = heatmapColours,
                                       show_column_names = TRUE,
                                       show_row_names = FALSE,
                                       show_heatmap_legend = TRUE,
                                       cluster_columns = FALSE,
                                       name = "Key",
                                       column_title = "Top hits, over time",
                                       top_annotation = heatmapAnno)

currHeatmap

When we look at the heatmap concerning affected skin, the differential expression is more obvious and more significant. There are many clusters of genes on the right side of the plot that correspond inversely to the same region on the left side of the plot. For example, the column of patient 7 has a large, strongly expressed group of genes near the middle of the column. However, when looking at the same area on the left side of the heatmap, the expression of the same genes is very weak in comparison. Thus, we can hypothesize that perhaps these differentially expressed genes may be responsible for AD and/or its symptoms.

3. Thresholded over-representation analysis

In this next part of the notebook, the sets of significantly up-regulated and down-regulated genes will be run through a thresholded gene set enrichment analysis. The first step is to get the lists of such genes.

# Construct the ranked gene list (non-thresholded)
qlfHitsT <- as.data.frame(qlfHitsPatient)
qlfHitsT["rank"] <- -log(qlfHitsT$PValue, base = 10) * sign(qlfHitsT$logFC)
qlfHitsT <- qlfHitsT[order(qlfHitsT$rank), ]

write.table(data.frame(genename = rownames(qlfHitsT), Fstat = qlfHitsT$rank),
            file = file.path(getwd(), "rankedGenelist.txt"),
            sep = "\t",
            row.names = FALSE,
            col.names = FALSE,
            quote = FALSE)

# For the thresholded lists, use p = 0.05 instead of 0.01 to get more results
upregGenes <- rownames(qlfHitsT[which(qlfHitsT$PValue < 0.05 & qlfHitsT$logFC > 0), ])
write.table(upregGenes,
            file = file.path(getwd(), "upregulatedGenes.txt"),
            sep = "\t",
            row.names = FALSE,
            col.names = FALSE,
            quote = FALSE)

downregGenes <- rownames(qlfHitsT[which(qlfHitsT$PValue < 0.05 & qlfHitsT$logFC < 0), ])
write.table(downregGenes,
            file = file.path(getwd(), "downregulatedGenes.txt"),
            sep = "\t",
            row.names = FALSE,
            col.names = FALSE,
            quote = FALSE)

The technique that has been chosen to perform this analysis is Fisher’s exact test, done through g:Profiler (Kolberg et al., 2020). Fisher’s exact test is a common statistical method to check for overrepresentation, which is the goal of this analysis, and is able to calculate the deviation from p-values exactly. g:Profiler is a common, convenient (accessible online and through an R package) implementation of Fisher’s exact test that is updated fairly regularly, which can help increase the accuracy of the analysis. For our purpose, and in the interest of time, g:Profiler will be accessed through its website (instead of using the R package).

The annotation datasets that have been selected include GO biological processes (updated 2021-12-15) - without electronic annotations, Reactome (updated 2022-01-03), and WikiPathways (updated 2021-12-10). These three datasets are major annotation resources, and have been updated fairly recently, so it is reasonable to expect accuracte results. Other major resources such as KEGG have not been included because they may add additional extraneous information to the results, and with these three datasets, there should already be enough information for a thorough analysis.

We will start by looking at subsets of the genes of interest (i.e., those with significant differential expression), and then follow up by analyzing the entire set to compare how looking at different parts of the data may affect the results. For each run, FDR will be used for multiple hypothesis correction, with a threshold p-value of 0.05. This was chosen instead of the more stringent 0.01 to allow for more information and a broader look at the associated pathways. In addition, the resulting pathways were filtered to only include those with gene set sizes of 200 or less. This is because of the inheritance-like property of the annotation resources, which may clutter the results by including pathways that are too generic.

Upregulated genes

The first subset is all upregulated genes with significant differential expression. The list of upregulated genes (from the code snippet above) was copied to g:Profiler, the mentioned options were set (with all results checked), and the analysis was run. The result was the following figure:

Figure 1. upregulated genes only

g:Profiler has omitted titles for the plot and table, but appropriate ones would be “Over-representation of upregulated genes with p-value < 0.05” and “Highest scoring pathways for gene sets with size less than 200,” respectively.

Using the thresholds mentioned previously, GO:BP found 5,027 pathways, REAC found 1,007, and WP found 445. The plot at the top of the figure is a visual representation of adjusted p-values vs. resource, where each dot represents a specific pathway (each annotation resource has its own colour). The table at the bottom of the figure is a list of the top (approximately 15) pathways, according to adjusted p-value, for each resource. The individual pathways are given numbers, and these correspond to the labelling of the dots in the plot above. We see that despite being the pathways with the greatest adjusted p-value, these labelled dots are not the highest on the plot because of the thresholds that were used; the unlabelled dots would represent more generic pathways.

The specific pathways that were found here–for example, collagen fibril organization and several fatty acid metabolic processes–are not obviously indicative of AD, a skin disease of autoimmune origin.

Downregulated genes

The second subset is all downregulated genes with significant differential expression. The list of downregulated genes was copied to g:Profiler, options were set (this time without all results, since the list of genes was too large), and the analysis was run. The result was the following figure:

Figure 2. downregulated genes only

g:Profiler has omitted titles for the plot and table, but appropriate ones would be “Over-representation of downregulated genes with p-value < 0.05” and “Highest scoring pathways for gene sets with size less than 200,” respectively.

Using the thresholds mentioned previously, GO:BP found 230 pathways, REAC found 72, and WP found 52. The specific pathways that were found here–for example, various DNA replication pathways and cell cycle signalling–is, like the upregulated genes, not directly related to AD. However, these examples are not as broad, and it is possible that there is an indirect link. For example, psoriasis, another skin condition similar to atopic dermatitis, is directly related to the disregulation of the cell cycle of skin cells, so it would not be a surprise to see a similar result here.

All differentially expressed genes

It is important to consider the entire set of differentially expressed genes as a whole, because it is possible that the particular pathways that we are expecting have both upregulated and downregulated genes. The result of running g:Profiler on the list of combined genes is as follows:

Figure 3. all differentially expressed genes

g:Profiler has omitted titles for the plot and table, but appropriate ones would be “Over-representation of significant differentially expressed genes with p-value < 0.05” and “Highest scoring pathways for gene sets with size less than 200,” respectively.

Using the thresholds mentioned previously, GO:BP found 130 pathways, REAC found 37, and WP found 15. When we put both the upregulated genes and downregulated genes together through over-representation analysis, it appears that we get a better overall image of the pathways that AD affects. For example, the top GO:BP hit is now directly related to skin development, an obvious link to AD. In addition, we see other pathways related to immune function; for example, the Toll-like receptor 4 (TLR4) cascade, which is responsible for the production of cytokines.

Since many of these pathways were not present when only the up- or downregulated genes were considered, it is possible to conclude that AD is involved in more complex pathways that cannot be restricted to just one type of gene. The entire set of differentially expressed genes needs to be considered, and only then will the relevant pathways appear.

4. Interpretation

With regard to the differential expression analysis, several of the most significant genes that were found were also found by the authors of the original paper. For example, one with a particularly high level of expression was IL19, with the paper stating that they “detected a notable expression of IL19 … in AL,” exactly what was found here. However, this analysis did find some highly expressed genes that were not of note in the paper; for example, NPTX2, related to the C-reactive protein, was not noted in the paper. Additionally, other genes like IL31 and IL23 that were noted in the paper were not among the most expressed genes here. Thus, we can say that for this part of the notebook, the results are at least partially supported by the original paper.

When looking at pathways, the result of over-representation analysis, we see some similarities between those that were found here and those that were found in the original paper. The paper states that the “core transcriptome signature” of AD patients is characterized by epidermal differentiation and itch pathways, and gives sebaceous glands as an example. In the ORA of all differentially expressed genes, the highest scored pathway is related to skin development, of which “sebaceous gland development” is a child.

Additional pathways that were found in both include general Toll-like receptor signalling, interleukin signalling, and several other immune-related pathways. However, the paper does mention specific examples like Th17 cell differentiation and the IL-17 signalling pathway, which were not exactly found in the ORA. Thus, we should be able to conclude that for this part of the notebook, the results overall are supported by the original paper, but in order to confirm the presence of more specific pathways, more detailed examination must be done.

When considering the differentially expressed genes and pathways, another way of looking for evidence to support that the analyses in this notebook (other than comparing with the original paper) is to look for additional papers that have reached the same conclusion. One study (Oka et al., 2015) showed that IL-19, one of the most significantly expressed interleukin genes found here, was notably increased in patients with AD. Statistical analysis (primarily done via the Mann-Whitney U test) was performed, and in summary, concluded that there was statistically significant evidence to suggest that AD patients had elevated IL-19 protein levels, a reflection of significant expression.

A second paper (Panzer et al., 2014) found that AD patients also have significant levels of TLR4 expression as part of Toll-like receptor signalling pathways. Samples from lesional skin were taken, incubated with TLR4 antibodies, and then RNA was extracted, which was found to show constitutive expression of TLR4. There have even been previous studies on using dupilumab and cyclosporine to treat AD; a thid paper (Mansouri & Guttman-Yassky, 2015) specifically studied the effects of these drugs on human immune pathways, and found that they are able to block some interleukin signalling pathways, which in turn resulted in clinical improvements. Thus, this shows that there is further support for the pathways discovered by the ORA conducted here.

5. References

Blighe, K., Rana, S., & Lewis, M. (2021). EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. https://github.com/kevinblighe/EnhancedVolcano
Davis, S., & Meltzer, P. (2007). GEOquery: A bridge between the gene expression omnibus (GEO) and BioConductor. Bioinformatics, 23(14), 1846–1847. https://doi.org/10.1093/bioinformatics/btm254
Durinck, S., Spellman, P. T., Birney, E., & Huber, W. (2009). Mapping identifiers for the integration of genomic datasets with the r/bioconductor package biomaRt. Nat Protoc, 4(8), 1184–1191. https://doi.org/ doi: 10.1038/nprot.2009.97
Gu, Z., Eils, R., & Schlesner, M. (2016). Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics, 32(18), 2847–2849. https://doi.org/10.1093/bioinformatics/btw313
Gu, Z., Gu, L., Eils, R., Schlesner, M., & Brors, B. (2014). Circlize implements and enhances circular visualization in r. Bioinformatics, 30(19), 2811–2812. https://doi.org/10.1093/bioinformatics/btu393
Kolberg, L., Raudvere, U., Kuzmin, I., Vilo, J., & Peterson, H. (2020). gprofiler2 – an r package for gene list functional enrichment analysis and namespace conversion toolset g:profiler. F1000Res, 9(709). https://doi.org/10.12688/f1000research.24956.2
Mansouri, Y., & Guttman-Yassky, E. (2015). Immune pathways in atopic dermatitis, and definition of biomarkers through broad and targeted therapeutics. J Clin Med, 4(5), 858–873. https://doi.org/10.3390/jcm4050858
Möbus, L., Rodriguez, E., Harder, I., Stölzl, D., Boraczynski, N., Gerdes, S., Kleinheinz, A., Abraham, S., Heratizadeh, A., Handrick, C., Haufe, E., Werfel, T., Schmitt, J., Weidinger, S., & the TREATgermany study group. (2021). Atopic dermatitis displays stable and dynamic skin transcriptome signatures. J Allergy Clin Immun, 147(1), 213–223. https://doi.org/10.1016/j.jaci.2020.06.012
Morgan, M. (2021). BiocManager: Access the bioconductor project package repository. https://CRAN.R-project.org/package=BiocManager
Oka, T., Sugaya, M., Takahashi, N., Nakajima, R., Otobe, S., Kabasawa, M., Suga, H., Miyagaki, T., Asano, Y., & Sato, S. (2015). Increased interleukin-19 expression in cutaneous t-cell lymphoma and atopic dermatitis. Acta Derm Venereol, 97(10), 1172–1177. https://doi.org/10.2340/00015555-2723
Panzer, R., Blobel, C., Fölster-Holst, R., & Proksch, E. (2014). TLR2 and TLR4 expression in atopic dermatitis, contact dermatitis and psoriasis. Exp Dermatol, 23(5), 364–366. https://doi.org/10.1111/exd.12383
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616
Xie, Y. (2021). Knitr: A general-purpose package for dynamic report generation in r. https://yihui.org/knitr
---
title: "BCB420 A2 - Differential gene expression and preliminary ORA"
author: "Yijia Chen"
date: "2022-03-16"
output:
  html_notebook:
    toc: true
bibliography: A2_citations.bib
csl: apa.csl
nocite: "@*"
---

```{r, message = FALSE, warning = FALSE, include = FALSE}
# Several R packages are required to run this notebook. They can be installed by running this snippet
# This snippet will not be included in the final report because the installation of EnhancedVolcano will always
# generate output that cannot be suppressed by message = FALSE or warning = FALSE

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!requireNamespace("GEOquery", quietly = TRUE)) {
  BioCManager::install("GEOquery")
}
if (!requireNamespace("knitr", quietly = TRUE)) {
  install.packages("knitr")
}
if (!requireNamespace("edgeR", quietly = TRUE)) {
  BioCManager::install("edgeR")
}
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
  BiocManager::install("ComplexHeatmap")
}
if (!requireNamespace("circlize", quietly = TRUE)) {
  install.packages("circlize")
}
if (!requireNamespace("EnhancedVolcano", quietly = TRUE)) {
  BiocManager::install("EnhancedVolcano", update = FALSE, ask = FALSE)
}
library(EnhancedVolcano)
```

## 1. Introduction

This notebook contains a differential gene expression analysis and a preliminary thresholded over-representation
analysis for a selected dataset from the GEO database. The dataset in particular is
[GSE157194](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE157194), which is associated with a publication
titled "Atopic dermatitis displays stable and dynamic skin transcriptome signatures" [@mobus].

The goal of the study was to understand the effect of various treatments for atopic dermatitis by conducting a gene
expression study using mRNA-Seq data taken from biopsies of the affected and non-affected skin of human patients. The
dataset contains data for 57 patients, with samples of affected (intrapersonal lesional) and non-affected
(non-lesional) skin, along with follow-ups after three months of treatment (with either dupilumab or cyclosporine). An
important note is that the dataset is "incomplete"/has several prominent issues:

- There is no distinction between patients and controls (or the control patients were omitted, which is likely)
- Not all patients have samples for both affected and non-affected skin
- Not all patients have follow-up data after treatment
- No patients are labelled with which drug (dupilumab/cyclosporine) was used as treatment

No information regarding these missing features were found in the associated paper.

After downloading the raw count matrix from GEO via GEOquery [@geoquery], the data was cleaned (confirming that there
were no duplicate genes), filtered by gene (removing those without at least one read per million in each sample), and
filtered by patient (removing those with incomplete data; see the points above). The genes (rows) were labelled by
their Ensembl gene IDs, which were mapped to HUGO gene symbols using biomaRt [@biomart], and genes that were unable to
be converted (such as novel genes) were removed. To normalize the data, trimmed mean of m-values/TMM was used [@edger],
since we followed the assumption that most genes are not differentially expressed. Finally, the data was checked for
outliers (none were found), and the resulting data frame was written to a tab-delimited txt file (and gzipped for
upload to GitHub).

We can load the dataset here and take a look at the first few rows and columns:
```{r, message = FALSE, warning = FALSE}
# The dataset should originally be compressed (GitHub 25 MB)
filepath <- file.path(getwd(), "normalizedCountData.txt")
if (!file.exists(filepath)) {
  gzfilepath <- file.path(getwd(), "normalizedCountData.txt.gz")
  if (!file.exists(gzfilepath)) {
    stop("Dataset not found")
  }
  GEOquery::gunzip(gzfilepath)
}

normalizedCountData <- read.table(file = file.path(getwd(), "normalizedCountData.txt"),
                                  sep = "\t",
                                  header = TRUE,
                                  stringsAsFactors = FALSE,
                                  check.names = FALSE)
knitr::kable(normalizedCountData[1:4, 1:4], type = "html")
```

Number of patients: `r ncol(normalizedCountData) / 4`  
Number of samples: `r ncol(normalizedCountData)`  
Number of genes: `r nrow(normalizedCountData)`

The column names are formatted "Patient_\<patient number\>\_\<AN|AL\>_\<m0|m3\>", where AN means intrapersonal
non-lesional, AL means intrapersonal lesional, m0 means the sample was taken before treatment, and m3 means the sample
was taken after three months of treatment. Note the absence of patient 2 between patients 1 and 3, who was removed due
to missing data.

## 2. Differential gene expression

The corresponding journal entry for this assignment can be found
[here](https://github.com/bcb420-2022/Yijia_Chen/wiki/8.-Assignment-2-%E2%80%90-Additional-work-and-details).

First, we should inspect the data as a heatmap to get a general idea of gene expression levels. But before we do that,
we should row-normalize it, which scales rows and centres them around the mean, so that no particular gene completely
dominates the plot.

```{r, message = FALSE, warning = FALSE}
# Parameters to select the range of samples to use. The matrix has 17,747 rows and 96 columns, and if too much data is
# chosen, the code will crash
SAMPLE_START <- 1
# Setting SAMPLE_END to 12 or more has been tested to crash
SAMPLE_END <- 8
heatmapMatrix <- normalizedCountData[, SAMPLE_START:SAMPLE_END]

# Row-normalization can be done with the base scale function on the transposed matrix
heatmapMatrix <- t(scale(t(heatmapMatrix)))
heatmapMatrix[is.na(heatmapMatrix)] <- 0

if (min(heatmapMatrix) == 0) {
  # No negative values => colours should only range from white to red
  heatmapColours <- circlize::colorRamp2(c(0, max(heatmapMatrix)),
                                         c("white", "red"))
} else {
  # Has negative values => use blue for those
  heatmapColours <- circlize::colorRamp2(c(min(heatmapMatrix), 0, max(heatmapMatrix)),
                                         c("blue", "white", "red"))
}

currHeatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmapMatrix),
                                       show_row_dend = TRUE,
                                       show_column_dend = FALSE,
                                       col = heatmapColours,
                                       show_column_names = TRUE,
                                       show_row_names = FALSE,
                                       show_heatmap_legend = TRUE,
                                       cluster_columns = FALSE,
                                       name = "Key",
                                       column_title = "Normalized heatmap of samples")

# Displaying the heatmap may take a while to run, or just crash
currHeatmap
```

By default (and only if the last line in the code snippet directly above is not commented out), the heatmap will be
displayed for eight samples (four patients, both before treatment). The parameters at the start of the code snippet can
be changed to display different parts of the entire heatmap. Regardless, we can already see that there is significant
variation in the data between different patients. This will need to be taken into account when doing further analysis.

### Testing a single gene

The authors of the paper hypothesized that IL13 (interleukin 13, a gene related to type 2 cytokines and chemokines),
among others, were responsible for the lesions found in patients with AD, so we can run a quick test to see if there is
any change in expression for this particular gene among the entire set of patients.

```{r, message = FALSE, warning = FALSE}
# Get the columns corresponding to pre- and post-treatment affected samples
preTreatAffSamples <- grep(colnames(normalizedCountData), pattern = "_AL_m0")
postTreatAffSamples <- grep(colnames(normalizedCountData), pattern = "_AL_m3")

# Find the row index of IL13 in the data
il13Idx <- which(rownames(normalizedCountData) == "IL13")

# Format the expression levels (for both pre- and post-treatment) in a table for comparison
il13Expr <- t(normalizedCountData[il13Idx, preTreatAffSamples])
il13Expr <- as.data.frame(il13Expr)
il13Expr <- cbind(il13Expr, t(normalizedCountData[il13Idx, postTreatAffSamples]))
colnames(il13Expr) <- c("Pre-treatment", "Post-treatment")

# A visual inspection of some of the data may be interesting
knitr::kable(il13Expr[1:8, ], type = "html")
```

Looking at the expression change may not necessarily be conclusive, so we can also try a simple t-test.

```{r, message = FALSE, warning = FALSE}
t.test(x = t(il13Expr$`Pre-treatment`), y=t(il13Expr$`Post-treatment`))
```

We see that the results of the test do not suggest a difference in expression before and after treatment for the entire
set of patients, which provides evidence to suggest that patient variation is a major factor that needs to be accounted
for.

### Creating a model

To try and control for these variables, we can use various packages that provide methods specialized for differential
expression analysis. In particular, edgeR is a package especially suited for bulk RNA-Seq analysis, containing several
linear methods that we can use [@edger]. We start by setting up the counts header into individual categories, which
should make it easier to use different parts of the data for different analyses or plots.

```{r, message = FALSE, warning = FALSE}
# Defines different groups that can be combined for different things (adapted from A1)
splitColNames <- function(colNames) {
  return(unlist(strsplit(colNames, split = "_"))[c(2, 3, 4)])
}
groups <- data.frame(lapply(colnames(normalizedCountData), splitColNames))
colnames(groups) <- colnames(normalizedCountData)
rownames(groups) <- c("Patient", "Sample type", "Sample time")
groups <- data.frame(t(groups))

# Check the first few lines of the whole table to make sure the groups have been defined properly
knitr::kable(groups[1:4, ], type = "html")
```

Next, we can throw (parts of) the data into an MDS plot in order to get some insight as to how we should set up our
linear models.

```{r, message = FALSE, warning = FALSE}
# Parameters to select which subset of data to plot
PATIENT_START <- 1
PATIENT_END <- 5
# Get the affected subset of the data
affectedCountData <- normalizedCountData[grep(colnames(normalizedCountData), pattern = "_AL_")
                                         ][, c(PATIENT_START:PATIENT_END, (PATIENT_START + 24):(PATIENT_END + 24))]
affectedGroups <- groups[groups$Sample.type == "AL",
                         ][c(PATIENT_START:PATIENT_END, (PATIENT_START + 24):(PATIENT_END + 24)), ]

# Build the MDS plot, colouring by time (pre- and post-treatment)
dgelist <- edgeR::DGEList(counts = affectedCountData, group = affectedGroups$Sample.time)
edgeR::plotMDS.DGEList(dgelist, labels = rownames(affectedGroups),
                       col = c("darkgreen", "blue")[factor(affectedGroups$Sample.time)])
legend(x = "bottomleft", legend = c("Pre-treatment", "Post-treatment"), fill = c("darkgreen", "blue"))
title(main = "Distance between samples, coloured by treatment time")
```

Also try grouping by patient to see approximately how much variability there is.

```{r, message = FALSE, warning = FALSE}
# Colour by patient number
patientColours <- rainbow(PATIENT_END - PATIENT_START + 1)
edgeR::plotMDS.DGEList(dgelist, labels = rownames(affectedGroups),
                       col = patientColours)
title(main = "Distance between samples, coloured by patient number")
```

We see that there is no especially obvious clustering in treatment time or individual patient, which could be explained
by the imperfect data; e.g., perhaps the patients shown here used different drugs or some were actually controls.
Regardless, there is incentive to try fitting the data to a linear model to see if any improvements can be found.

edgeR provides a convenient quasi-likelihood method that has been recommended for bulk RNA-Seq experiments and is often
used for more complicated models. Since the dataset that we are working with is quite large and complex, it may be a
good candidate to use. We will first try a straightforward model that does not account for patient variability.

```{r, message = FALSE, warning = FALSE}
# Create a linear design matrix for edgeR quasi likelihood in treatment time
modelDesign <- model.matrix(~ groups$Sample.time)
dgelist <- edgeR::DGEList(counts = normalizedCountData, group = groups$Sample.time)
dgelist <- edgeR::estimateDisp(dgelist, modelDesign)
modelFit <- edgeR::glmQLFit(dgelist, modelDesign)

# Extract the results
qlfTime <- edgeR::glmQLFTest(modelFit, coef = "groups$Sample.timem3")
qlfHits <- edgeR::topTags(qlfTime, sort.by = "PValue", n = nrow(normalizedCountData))
knitr::kable(edgeR::topTags(qlfTime)[1:8, ], type = "html")
```

Number of genes that pass the threshold p-value (0.05): `r length(which(qlfHits$table$PValue < 0.05))`  
Number of genes that pass correction (FDR): `r length(which(qlfHits$table$FDR < 0.05))`

Since we are doing many tests, the probability that a positive result will occur simply by chance is increased. To
control for this, we can use a method of multiple hypothesis testing, which will try to correct the positivity rate.
edgeR's quasi-likelihood method does this by using false discovery rate, a fairly common way to correct the data.

We see that a fairly large number of genes are differentially expressed, even after correction with FDR, so an even
stronger threshold can be considered. This will allow for greater levels of confidence when analyzing the data, and
will also put a greater emphasis on which genes can be considered to be more important because they are significantly
differentially expressed.

Number of genes that pass a stricter p-value (0.01): `r length(which(qlfHits$table$PValue < 0.01))`  
Number of genes that pass correction (FDR): `r length(which(qlfHits$table$FDR < 0.01))`

Even stricter thresholds could be considered (such as 0.001), but we do not want to completely throw away the weaker
signal. We can next try to fit the data to a model that incorporates patient number into it, which may have better
results in downstream analysis.

```{r, message = FALSE, warning = FALSE}
# Essentially the same as before, but with additional data
modelDesignPatient <- model.matrix(~ groups$Patient + groups$Sample.time)
dgelist <- edgeR::DGEList(counts = normalizedCountData, group = groups$Sample.time)
dgelist <- edgeR::estimateDisp(dgelist, modelDesignPatient)
modelFit <- edgeR::glmQLFit(dgelist, modelDesignPatient)

qlfTime <- edgeR::glmQLFTest(modelFit, coef = "groups$Sample.timem3")
qlfHitsPatient <- edgeR::topTags(qlfTime, sort.by = "PValue", n = nrow(normalizedCountData))
knitr::kable(edgeR::topTags(qlfTime[1:8, ]), type = "html")
```

Number of genes that pass the threshold p-value (0.05): `r length(which(qlfHitsPatient$table$PValue < 0.05))`  
Number of genes that pass correction (FDR): `r length(which(qlfHitsPatient$table$FDR < 0.05))`

Number of genes that pass a stricter p-value (0.01): `r length(which(qlfHitsPatient$table$PValue < 0.01))`  
Number of genes that pass correction (FDR): `r length(which(qlfHitsPatient$table$FDR < 0.01))`

When applying a model that takes patient variability into account, we see that the number of genes that are
significantly differentially expressed increases, which is not unexpected.

### Genes of interest

A quick next step is to try and find genes of interest; e.g., those that are the most differentially expressed. One
method of accomplishing this task is to use a volcano plot, which puts p-value against log-fold change. In this way,
genes with the most meaningful or significant changes will be found near the top-left and top-right corners of the
graph. Here, the R package EnhancedVolcano is used to quickly throw together such a plot [@enhancedvolcano].

```{r, message = FALSE, warning = FALSE}
# Get the log-fold change and p-values from the patient-variability-included model
volcano <- data.frame(log2FC = qlfTime$table$logFC, PValue = qlfTime$table$PValue)
rownames(volcano) <- rownames(normalizedCountData)
EnhancedVolcano::EnhancedVolcano(volcano, lab = rownames(volcano), x = "log2FC", y = "PValue",
                                 title = "Differentially expressed genes")
```

We see that the vast majority of genes (17,747 in total) are not significantly differentially expressed, being plotted
toward the bottom-centre of the graph, but there are a few genes that present themselves as of interest. For example,
the plot identifies CCL18, IL19, NPTX2, ADORA3, ALOX15, SPARC, ACSM6, and SPINK9 as genes with significant differences
in levels of expression.

The next step is to return to looking at heatmaps; we can now plot the results of applying the quasi-likelihood model
to the data, which may result in clearer clusters that can be visually identified.

```{r, message = FALSE, warning = FALSE}
# Using the stricter p-value threshold
topHits <- rownames(qlfHits$table)[qlfHits$table$PValue < 0.01]
# Warning: heatmapMatrix may only have been built with a subset of the data. topHeatmapMatrix here will reflect that
# subset. Change SAMPLE_START and SAMPLE_END in the code snippet at the start of this section to change the selected
# data
topHeatmapMatrix <- t(scale(t(heatmapMatrix[which(rownames(heatmapMatrix) %in% topHits), ])))
topHeatmapMatrix[is.na(topHeatmapMatrix)] <- 0

if (min(topHeatmapMatrix) == 0) {
  heatmapColours <- circlize::colorRamp2(c(0, max(topHeatmapMatrix)),
                                         c("white", "red"))
} else {
  heatmapColours <- circlize::colorRamp2(c(min(topHeatmapMatrix), 0, max(topHeatmapMatrix)),
                                         c("blue", "white", "red"))
}

groupColours <- c("blue", "red")
names(groupColours) <- unique(groups$Sample.type)
heatmapAnno <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(Type = groups$Sample.type[SAMPLE_START:SAMPLE_END]),
                                                 col = list(type = groupColours))

currHeatmap <- ComplexHeatmap::Heatmap(as.matrix(topHeatmapMatrix),
                                       show_row_dend = TRUE,
                                       show_column_dend = TRUE,
                                       col = heatmapColours,
                                       show_column_names = TRUE,
                                       show_row_names = FALSE,
                                       show_heatmap_legend = TRUE,
                                       cluster_columns = FALSE,
                                       name = "Key",
                                       column_title = "Top hits heatmap",
                                       top_annotation = heatmapAnno)

currHeatmap
```

This heatmap is not particularly useful in determining if differentially expressed genes cluster, because the columns
are ordered such that individual patient data is paired together. For example, the first column here is the control
sample for the first patient (the non-affected skin) pre-treatment, and the second column is the affected skin sample,
also pre-treatment. It should be possible to get a better idea of clustering over time by reordering the columns.

```{r, message = FALSE, warning = FALSE}
# Build a new heatmap matrix to decouple from SAMPLE_START and SAMPLE_END
reHeatmapMatrix <- normalizedCountData
reHeatmapMatrix <- t(scale(t(reHeatmapMatrix)))
reHeatmapMatrix[is.na(reHeatmapMatrix)] <- 0

# Don't know how to do this in a one-liner
newIdxs <- integer(SAMPLE_END - SAMPLE_START + 1)
currIdx <- 1
# Add the *_AN_m0 first
for (i in seq(SAMPLE_START, SAMPLE_END, 2)) {
  newIdxs[currIdx] <- i
  currIdx <- currIdx + 1
}
# Then add the *_AN_m3
for (i in seq(SAMPLE_START, SAMPLE_END, 2)) {
  newIdxs[currIdx] <- i + 48
  currIdx <- currIdx + 1
}

topHeatmapMatrix <- t(scale(t(reHeatmapMatrix[which(rownames(reHeatmapMatrix) %in% topHits), ])))
topHeatmapMatrix[is.na(topHeatmapMatrix)] <- 0
topHeatmapMatrix <- topHeatmapMatrix[, newIdxs]

groupColours <- c("blue", "red")
names(groupColours) <- unique(groups$Sample.time)
heatmapAnno <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(Time = groups$Sample.time[newIdxs]),
                                                 col = list(type = groupColours))

currHeatmap <- ComplexHeatmap::Heatmap(as.matrix(topHeatmapMatrix),
                                       show_row_dend = TRUE,
                                       show_column_dend = TRUE,
                                       col = heatmapColours,
                                       show_column_names = TRUE,
                                       show_row_names = FALSE,
                                       show_heatmap_legend = TRUE,
                                       cluster_columns = FALSE,
                                       name = "Key",
                                       column_title = "Top hits, control, over time",
                                       top_annotation = heatmapAnno)

currHeatmap
```

The above heatmap has been rearranged such that all eight columns are control (non-affected/non-lesional) skin of
affected patients, with the left four columns being samples taken just prior to treatment, and the right four samples
being taken after three months of treatment. Since this subset of the dataset does not contain any samples of affected
(lesional) skin, it is not surprising that levels of gene expression are similar across the two categories. The most
obvious difference would be with respect to patient 3 (and patient 1, to a certain extent), where the follow-up sample
has greater levels of regulation--but it is still of the same type.

```{r, message = FALSE, warning = FALSE}
# Doing the same as above except with *_AL_* instead of *_AN_*
newIdxs <- integer(SAMPLE_END - SAMPLE_START + 1)
currIdx <- 1
for (i in seq(SAMPLE_START + 1, SAMPLE_END + 1, 2)) {
  newIdxs[currIdx] <- i
  currIdx <- currIdx + 1
}
for (i in seq(SAMPLE_START + 1, SAMPLE_END + 1, 2)) {
  newIdxs[currIdx] <- i + 48
  currIdx <- currIdx + 1
}

topHeatmapMatrix <- t(scale(t(reHeatmapMatrix[which(rownames(reHeatmapMatrix) %in% topHits), ])))
topHeatmapMatrix[is.na(topHeatmapMatrix)] <- 0
topHeatmapMatrix <- topHeatmapMatrix[, newIdxs]

groupColours <- c("blue", "red")
names(groupColours) <- unique(groups$Sample.time)
heatmapAnno <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(Time = groups$Sample.time[newIdxs]),
                                                 col = list(type = groupColours))

currHeatmap <- ComplexHeatmap::Heatmap(as.matrix(topHeatmapMatrix),
                                       show_row_dend = TRUE,
                                       show_column_dend = TRUE,
                                       col = heatmapColours,
                                       show_column_names = TRUE,
                                       show_row_names = FALSE,
                                       show_heatmap_legend = TRUE,
                                       cluster_columns = FALSE,
                                       name = "Key",
                                       column_title = "Top hits, over time",
                                       top_annotation = heatmapAnno)

currHeatmap
```

When we look at the heatmap concerning affected skin, the differential expression is more obvious and more significant.
There are many clusters of genes on the right side of the plot that correspond inversely to the same region on the left
side of the plot. For example, the column of patient 7 has a large, strongly expressed group of genes near the middle
of the column. However, when looking at the same area on the left side of the heatmap, the expression of the same genes
is very weak in comparison. Thus, we can hypothesize that perhaps these differentially expressed genes may be
responsible for AD and/or its symptoms.

## 3. Thresholded over-representation analysis

In this next part of the notebook, the sets of significantly up-regulated and down-regulated genes will be run through
a thresholded gene set enrichment analysis. The first step is to get the lists of such genes.

```{r, message = FALSE, warning = FALSE}
# Construct the ranked gene list (non-thresholded)
qlfHitsT <- as.data.frame(qlfHitsPatient)
qlfHitsT["rank"] <- -log(qlfHitsT$PValue, base = 10) * sign(qlfHitsT$logFC)
qlfHitsT <- qlfHitsT[order(qlfHitsT$rank), ]

write.table(data.frame(genename = rownames(qlfHitsT), Fstat = qlfHitsT$rank),
            file = file.path(getwd(), "rankedGenelist.txt"),
            sep = "\t",
            row.names = FALSE,
            col.names = FALSE,
            quote = FALSE)

# For the thresholded lists, use p = 0.05 instead of 0.01 to get more results
upregGenes <- rownames(qlfHitsT[which(qlfHitsT$PValue < 0.05 & qlfHitsT$logFC > 0), ])
write.table(upregGenes,
            file = file.path(getwd(), "upregulatedGenes.txt"),
            sep = "\t",
            row.names = FALSE,
            col.names = FALSE,
            quote = FALSE)

downregGenes <- rownames(qlfHitsT[which(qlfHitsT$PValue < 0.05 & qlfHitsT$logFC < 0), ])
write.table(downregGenes,
            file = file.path(getwd(), "downregulatedGenes.txt"),
            sep = "\t",
            row.names = FALSE,
            col.names = FALSE,
            quote = FALSE)
```

The technique that has been chosen to perform this analysis is Fisher's exact test, done through g:Profiler
[@gprofiler]. Fisher's exact test is a common statistical method to check for overrepresentation, which is the goal of
this analysis, and is able to calculate the deviation from p-values exactly. g:Profiler is a common, convenient
(accessible online and through an R package) implementation of Fisher's exact test that is updated fairly regularly,
which can help increase the accuracy of the analysis. For our purpose, and in the interest of time, g:Profiler will be
accessed through its website (instead of using the R package).

The annotation datasets that have been selected include GO biological processes (updated 2021-12-15) - without
electronic annotations, Reactome (updated 2022-01-03), and WikiPathways (updated 2021-12-10). These three datasets are
major annotation resources, and have been updated fairly recently, so it is reasonable to expect accuracte results.
Other major resources such as KEGG have not been included because they may add additional extraneous information to the
results, and with these three datasets, there should already be enough information for a thorough analysis.

We will start by looking at subsets of the genes of interest (i.e., those with significant differential expression),
and then follow up by analyzing the entire set to compare how looking at different parts of the data may affect the
results. For each run, FDR will be used for multiple hypothesis correction, with a threshold p-value of 0.05. This was
chosen instead of the more stringent 0.01 to allow for more information and a broader look at the associated pathways.
In addition, the resulting pathways were filtered to only include those with gene set sizes of 200 or less. This is
because of the inheritance-like property of the annotation resources, which may clutter the results by including
pathways that are too generic.

### Upregulated genes

The first subset is all upregulated genes with significant differential expression. The list of upregulated genes (from
the code snippet above) was copied to g:Profiler, the mentioned options were set (with all results checked), and the
analysis was run. The result was the following figure:

![Figure 1. upregulated genes only](figures/gprof_up.png)

g:Profiler has omitted titles for the plot and table, but appropriate ones would be "Over-representation of upregulated
genes with p-value < 0.05" and "Highest scoring pathways for gene sets with size less than 200", respectively.

Using the thresholds mentioned previously, GO:BP found 5,027 pathways, REAC found 1,007, and WP found 445. The plot at
the top of the figure is a visual representation of adjusted p-values vs. resource, where each dot represents a
specific pathway (each annotation resource has its own colour). The table at the bottom of the figure is a list of the
top (approximately 15) pathways, according to adjusted p-value, for each resource. The individual pathways are given
numbers, and these correspond to the labelling of the dots in the plot above. We see that despite being the pathways
with the greatest adjusted p-value, these labelled dots are not the highest on the plot because of the thresholds that
were used; the unlabelled dots would represent more generic pathways.

The specific pathways that were found here--for example, collagen fibril organization and several fatty acid metabolic
processes--are not obviously indicative of AD, a skin disease of autoimmune origin.

### Downregulated genes

The second subset is all downregulated genes with significant differential expression. The list of downregulated genes
was copied to g:Profiler, options were set (this time without all results, since the list of genes was too large), and
the analysis was run. The result was the following figure:

![Figure 2. downregulated genes only](figures/gprof_down.png)

g:Profiler has omitted titles for the plot and table, but appropriate ones would be "Over-representation of
downregulated genes with p-value < 0.05" and "Highest scoring pathways for gene sets with size less than 200",
respectively.

Using the thresholds mentioned previously, GO:BP found 230 pathways, REAC found 72, and WP found 52. The specific
pathways that were found here--for example, various DNA replication pathways and cell cycle signalling--is, like the
upregulated genes, not directly related to AD. However, these examples are not as broad, and it is possible that there
is an indirect link. For example, psoriasis, another skin condition similar to atopic dermatitis, is directly related
to the disregulation of the cell cycle of skin cells, so it would not be a surprise to see a similar result here.

### All differentially expressed genes

It is important to consider the entire set of differentially expressed genes as a whole, because it is possible that
the particular pathways that we are expecting have both upregulated and downregulated genes. The result of running
g:Profiler on the list of combined genes is as follows:

![Figure 3. all differentially expressed genes](figures/gprof_de.png)

g:Profiler has omitted titles for the plot and table, but appropriate ones would be "Over-representation of significant
differentially expressed genes with p-value < 0.05" and "Highest scoring pathways for gene sets with size less than
200", respectively.

Using the thresholds mentioned previously, GO:BP found 130 pathways, REAC found 37, and WP found 15. When we put both
the upregulated genes and downregulated genes together through over-representation analysis, it appears that we get a
better overall image of the pathways that AD affects. For example, the top GO:BP hit is now directly related to skin
development, an obvious link to AD. In addition, we see other pathways related to immune function; for example, the
Toll-like receptor 4 (TLR4) cascade, which is responsible for the production of cytokines.

Since many of these pathways were not present when only the up- or downregulated genes were considered, it is possible
to conclude that AD is involved in more complex pathways that cannot be restricted to just one type of gene. The entire
set of differentially expressed genes needs to be considered, and only then will the relevant pathways appear.

## 4. Interpretation

With regard to the differential expression analysis, several of the most significant genes that were found were also
found by the authors of the original paper. For example, one with a particularly high level of expression was IL19,
with the paper stating that they "detected a notable expression of IL19 ... in AL", exactly what was found here.
However, this analysis did find some highly expressed genes that were not of note in the paper; for example, NPTX2,
related to the C-reactive protein, was not noted in the paper. Additionally, other genes like IL31 and IL23 that were
noted in the paper were not among the most expressed genes here. Thus, we can say that for this part of the notebook,
the results are at least partially supported by the original paper.

When looking at pathways, the result of over-representation analysis, we see some similarities between those that were
found here and those that were found in the original paper. The paper states that the "core transcriptome signature" of
AD patients is characterized by epidermal differentiation and itch pathways, and gives sebaceous glands as an example.
In the ORA of all differentially expressed genes, the highest scored pathway is related to skin development, of which
"sebaceous gland development" is a child.

Additional pathways that were found in both include general Toll-like receptor signalling, interleukin signalling, and
several other immune-related pathways. However, the paper does mention specific examples like Th17 cell differentiation
and the IL-17 signalling pathway, which were not exactly found in the ORA. Thus, we should be able to conclude that for
this part of the notebook, the results overall are supported by the original paper, but in order to confirm the
presence of more specific pathways, more detailed examination must be done.

When considering the differentially expressed genes and pathways, another way of looking for evidence to support that
the analyses in this notebook (other than comparing with the original paper) is to look for additional papers that
have reached the same conclusion. One study [@oka] showed that IL-19, one of the most significantly expressed
interleukin genes found here, was notably increased in patients with AD. Statistical analysis (primarily done via the
Mann-Whitney U test) was performed, and in summary, concluded that there was statistically significant evidence to
suggest that AD patients had elevated IL-19 protein levels, a reflection of significant expression.

A second paper [@panzer] found that AD patients also have significant levels of TLR4 expression as part of
Toll-like receptor signalling pathways. Samples from lesional skin were taken, incubated with TLR4 antibodies, and then
RNA was extracted, which was found to show constitutive expression of TLR4. There have even been previous studies on
using dupilumab and cyclosporine to treat AD; a thid paper [@mansouri] specifically studied the effects of these drugs
on human immune pathways, and found that they are able to block some interleukin signalling pathways, which in turn
resulted in clinical improvements. Thus, this shows that there is further support for the pathways discovered by the
ORA conducted here.

## 5. References
